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Abstract 

A  family  of  regularized  least  squares  regression  models  in  a  Reproducing  Kernel  Hilbert 
Space  is  extended  by  the  kernel  partial  least  squares  (PLS)  regression  model.  Similar 
to  principal  components  regression  (PCR),  PLS  is  a  method  based  on  the  projection  of 
input  (explanatory)  variables  to  the  latent  variables  (components).  However,  in  contrast 
to  PCR,  PLS  creates  the  components  by  modeling  the  relationship  between  input  and 
output  variables  while  maintaining  most  of  the  information  in  the  input  variables.  PLS 
is  useful  in  situations  where  the  number  of  explanatory  variables  exceeds  the  number  of 
observations  and/or  a  high  level  of  multicollinearity  among  those  variables  is  assumed. 
Motivated  by  this  fact  we  will  provide  a  kernel  PLS  algorithm  for  construction  of  nonlinear 
regression  models  in  possibly  high-dimensional  feature  spaces. 

We  give  the  theoretical  description  of  the  kernel  PLS  algorithm  and  we  experimentally 
compare  the  algorithm  with  the  existing  kernel  PCR  and  kernel  ridge  regression  techniques. 

We  will  demonstrate  that  on  the  data  sets  employed  kernel  PLS  achieves  the  same  results 
as  kernel  PCR  but  uses  significantly  fewer,  qualitatively  different  components. 

1.  Introduction 

In  this  paper  we  will  focus  our  attention  on  least  squares  regression  models  in  a  Reproducing 
Kernel  Hilbert  Space  (RKHS).  The  models  are  derived  based  on  a  straightforward  connec¬ 
tion  between  a  RKHS  and  the  corresponding  feature  space  representation  where  the  input 
data  are  mapped.  In  our  previous  work  (Rosipal  et  ah,  2000a,  2001,  2000b)  we  proposed  the 
kernel  principal  components  regression  (PCR)  technique  and  we  also  made  theoretical  and 
experimental  comparison  to  kernel  ridge  regression  (RR)  (Saunders  et  ah,  1998,  Cristianini 
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and  Shawe-Taylor,  2000).  In  this  work  we  extend  the  family  with  a  new  nonlinear  kernel 
partial  least  squares  (PLS)  regression  method. 

Classical  PCR,  PLS  and  RR  techniques  are  well  known  shrinkage  estimators  designed 
to  deal  with  multicollinearity  (see,  e.g.,  Frank  and  Friedman,  1993,  Montgomery  and  Peck, 
1992,  Jolliffe,  1986).  The  multicollinearity  or  near-linear  dependence  of  regressors  is  a 
serious  problem  which  can  dramatically  influence  the  effectiveness  of  a  regression  model. 
Multicollinearity  results  in  large  variances  and  covariances  for  the  least  squares  estimators 
of  the  regression  coefficients.  Multicollinearity  can  also  produce  estimates  of  the  regression 
coefficients  that  are  too  large  in  absolute  value.  Thus  the  values  and  signs  of  estimated 
regression  coefficients  may  change  considerably  given  different  data  samples.  This  effect 
can  lead  to  a  regression  model  which  fits  the  training  data  reasonably  well,  but  generalizes 
poorly  to  new  data  (Montgomery  and  Peck,  1992).  This  fact  is  in  a  very  close  relation  to  the 
argument  stressed  in  (Smola  et  ah,  1998),  where  the  authors  have  shown  that  choosing  the 
flattest  linear  regression  function^  in  a  feature  space  can,  based  on  the  smoothing  properties 
of  the  selected  kernel  function,  lead  to  a  smooth  nonlinear  function  in  the  input  space. 

The  PLS  method  (Wold,  1975,  Wold  et  ah,  1984)  has  been  a  popular  regression  tech¬ 
niques  in  its  domain  of  origin — Chemometrics.  The  method  is  similar  to  PCR  where  prin¬ 
cipal  components  determined  solely  from  explanatory  variables  creates  orthogonal,  i.e.  un¬ 
correlated,  input  variables  in  a  regression  model.  In  contrast,  PLS  creates  orthogonal  com¬ 
ponents  by  using  the  existing  correlations  between  explanatory  variables  and  corresponding 
outputs  while  also  keeping  most  of  the  variance  of  explanatory  variables.  PLS  has  proven 
to  be  useful  in  situations  when  the  number  of  observed  variables  (iV)  is  significantly  greater 
than  the  number  of  observations  (n)  and  high  multicollinearity  among  the  variables  exists. 
This  situation  when  N  ^  nis  common  in  chemometrics  and  gave  rise  to  the  modification  of 
classical  principal  component  analysis  (PCA)  and  linear  PLS  methods  to  their  kernel  vari¬ 
ants  (Wu  et  ah,  1997a,  Rannar  et  ah,  1994,  Lewi,  1995).  However,  rather  than  assuming 
a  nonlinear  transformation  into  a  feature  space  of  arbitrary  dimensionality  the  authors  at¬ 
tempted  to  reduce  computational  complexity  in  the  input  space.  Motivated  by  these  works 
we  propose  a  more  general  nonlinear  kernel  PLS  algorithm.^ 

There  exist  several  nonlinear  versions  of  the  PLS  model.  In  (Frank,  1990,  1994)  ap¬ 
proaches  based  on  fitting  the  nonlinear  input-output  dependency  by  providing  the  extracted 
components  as  inputs  to  smoothers  and  spline-based  additive  nonlinear  regression  models 
were  proposed.  Another  nonlinear  PLS  model  (Malthouse,  1995,  Malthouse  et  ah,  1997)  is 
based  on  relatively  complicated  artificial  neural  network  modeling  of  nonlinear  PCA  and 
consequent  nonlinear  PLS.  From  that  point  our  approach  differs  in  the  sense  that  the  orig¬ 
inal  input  data  are  nonlinearly  mapped  to  a  feature  space  T  where  a  linear  PLS  model 
is  created.  Good  generalization  properties  of  the  corresponding  nonlinear  PLS  model  are 
then  achieved  by  appropriate  estimation  of  regression  coefficients  in  T  and  by  the  selection 
of  an  appropriate  kernel  function.  Moreover,  utilizing  the  kernel  function  corresponding  to 
the  canonical  dot  product  in  T  allows  us  to  avoid  the  nonlinear  optimization  involved  in 
the  above  approaches.  In  fact  only  linear  algebra  as  simple  as  in  a  linear  PLS  regression  is 
required. 

1.  The  flatness  is  defined  in  the  sense  of  penalizing  high  values  of  the  regression  coefficients  estimate. 

2.  In  the  following,  where  it  is  clear,  we  will  not  stress  this  nonlinear  essence  of  the  proposed  kernel  PLS 
regression  model  and  will  use  the  kernel  PLS  notation. 
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In  Section  2  a  basic  definition  of  a  RKHS  and  formulation  of  the  Representer  theorem 
is  given.  Section  3  describes  the  “classical”  PLS  algorithm.  In  Section  4  the  kernel  PLS 
method  is  given.  Some  of  the  properties  of  kernel  PLS  are  shown  using  a  simple  example. 
Kernel  PCR  and  kernel  RR  are  briefly  described  in  Section  5.  Section  6  describes  the  used 
model  selection  techniques.  The  results  are  given  in  Section  7.  Section  8  provides  a  short 
discussion  and  concludes  the  paper. 

2.  RKHS  and  Representer  Theorem 

The  common  aim  of  support  vector  machines,  regularization  networks,  Gaussian  processes 
and  spline  methods  (Vapnik,  1998,  Girosi,  1998,  Williams,  1998,  Wahba,  1990,  Gristianini 
and  Shawe-Taylor,  2000)  is  to  address  the  poor  generalization  properties  of  existing  nonlin¬ 
ear  regression  techniques.  To  overcome  this  problem  a  regularized  formulation  of  regression 
is  considered  as  a  variational  problem  in  a  RKHS  7i 

1  ” 

minR^ea(/)  =  -  '^V{yiJ{^i))  +  i\\f\\l^  •  (1) 

/Grt  n  — f 

1=1 

We  assume  a  training  set  of  regressors  to  be  a  subset  of  a  compact  set  X  C 

and  G  R  to  be  a  set  of  corresponding  outputs.  The  solution  to  the  problem  (1)  was 

given  by  Kimeldorf  and  Wahba  (1971),  Wahba  (1999)  and  is  known  as  the 

Representer  theorem  (simple  case):  Let  the  loss  function  V{yi,f)  be  a  functional  of  / 
which  depends  on  /  only  pointwise,  that  is,  through  {/(xj)})L^ — the  values  of  /  at 
the  data  points.  Then  any  solution  to  the  problem:  find  f  £  7i  to  minimize  (1)  has  a 
representation  of  the  form 

n 

f{^)  =  '^CiK{xi,x)  ,  (2) 

i=l 

where  G  R. 

In  this  formulation  ^  is  a  positive  number  (regularization  coefficient)  to  control  the  tradeoff 
between  approximating  properties  and  the  smoothness  of  /.  ||/|||^  is  a  norm  (sometimes 
called  “stabilizer”  in  the  regularization  networks  domain)  in  a  RKHS  7i  defined  by  the 
positive  definite  kernel  K(x,y);  i.e.  a  symmetric  function  of  two  variables  satisfying  the 
Mercer  theorem  conditions  (Mercer,  1909,  Gristianini  and  Shawe-Taylor,  2000).  The  fact 
that  for  any  such  positive  definite  kernel  there  exists  a  unique  RKHS  is  well  established 
by  the  Moore-Aronszjan  theorem  (Aronszajn,  1950).  The  form  K(x,  y)  has  the  following 
reprodueing  property 

/(y)  =  (/(x), K(x, y))^^  V/  G  R  , 

where  (.,  .)-^  is  the  scalar  product  in  7i.  The  function  K  is  called  a  reproducing  kernel  for  7i. 

It  follows  from  Mercer’s  theorem  that  each  positive  definite  kernel  iL(x,  y)  defined  on  a 
compact  domain  X  x  X  can  by  written  in  the  form 

M 

K(x,y)  =  ^  Ai(/)j(x)(/)j(y)  M  <  oo  ,  (3) 

i=l 
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where  are  the  eigenfunctions  of  the  integral  operator  Vk  '■  L2{X)  L2{X) 


(rA-/)(x)  =  /  K{yL,y)f{y)dy  \/f  £  L2{X) 


lx 


and  {Aj  >  0}^^  are  the  corresponding  positive  eigenvalues.  The  sequence  creates 

an  orthonormal  basis  of  TL  and  we  can  express  any  function  f  £  7i  as  /(x)  =  <2i</>i(x) 


for  some  a*  G  R.  This  allows  us  to  define  a  scalar  product  in  Tt: 

MM  M 


CLibi 


2=1 


2=1 


2  =  1 


and  the  norm  ||/||^ 


cij 

—  Z^2=l  A“- 
Rewriting  (3)  in  the  form 


M 


K{yi,y)  =  ^  ^/\i4>i{yL)^/\i(t)i{y)  =  (^>(x).d>(y))  =  ^>(x)^^>(y)  , 


2=1 


it  becomes  clear  that  any  kernel  K{x,y)  also  corresponds  to  a  canonical  (Euclidean)  dot 
product  in  a  possibly  high-dimensional  space  T  where  the  input  data  are  mapped  by 


d>  :  X 

X  ^  (V^</>l(x),  \/A2(/>2(x),  .  .  .  ,  '/X^<Pm{x.))  . 

The  space  T  is  usually  denoted  as  a  feature  spaee  and  x  G  X}  as  feature 

mappings.  The  number  of  basis  functions  (pi{.)  also  defines  the  dimensionality  of  J-.  It  is 
worth  noting,  that  we  can  also  construct  a  RKHS  and  a  corresponding  feature  space  by 
choosing  a  sequence  of  linearly  independent  functions  (not  necessary  orthogonal)  {V'j(x)}(^^ 
and  positive  numbers  ai  to  define  a  series  (in  the  case  of  M  =  oo  absolutely  and  uni¬ 
formly  convergent)  K{'x.,y)  =  ®*V'i(x)V'j(y).  This  also  gives  the  connection  between 

the  RKHS  and  Gaussian  processes  where  the  K  is  assumed  to  represent  the  correlation 
function  of  a  zero-mean  Gaussian  process  evaluated  at  points  x  and  y  (Wahba,  1990). 

Until  now,  we  assumed  that  K  is  a  positive  definite  kernel.  However,  the  above  results 
can  be  extended  even  for  the  case  when  K  is  a  positive  semidefinite.  Is  such  a  case  a  RKHS 
TL  contains  a  subspace  of  functions  /  with  a  zero  norm  ||/|||^  (the  null  space).  Kimeldorf 
and  Wahba  (1971)  showed  that  in  such  a  case  the  solution  of  (1)  leads  to  a  more  general 
form  of  the  Representer  theorem: 


n  I 

/(x)  =  x)  -F  ^  bjVjix.)  , 

i=l  j=l 

where  the  functions  {uj(.)}j'^^  span  the  null  space  of  TL  and  the  coefficients  {ci}(T^, 

are  again  given  by  the  data.  In  this  paper  we  will  consider  only  the  case  when  I  =  1  and 

ui(x)  =  const  V  X. 
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3.  Partial  Least  Squares  Regression 

PLS  regression  is  a  technique  for  modeling  a  linear  relationship  between  a  set  of  output 
variables  (responses)  {yj}(Li  ^  and  a  set  of  input  variables  (regressors)  G  .  In 

the  first  step,  PLS  creates  uncorrelated  latent  variables  which  are  linear  combinations  of  the 
original  regressors.  The  basic  point  of  the  procedure  is  that  the  weights  used  to  determine 
these  linear  combinations  of  the  original  regressors  are  proportional  to  the  covariance  among 
input  and  output  variables  (Helland,  1988).  A  least  squares  regression  is  then  performed  on 
the  subset  of  extracted  latent  variables.  This  leads  to  a  biased  but  lower  variance  estimate 
of  the  regression  coefficients  comparing  to  the  Ordinary  Least  Squares  (OLS)  regression. 

In  the  following  X  will  represent  the  (n  x  N)  matrix  of  n  inputs  and  Y  will  stand  for  the 
(n  X  L)  matrix  of  the  corresponding  L-dimensional  responses.  Further  we  assume  centered 
input  and  output  variables;  i.e.  the  columns  of  X  and  Y  are  zero  mean. 

There  exist  several  different  modifications  (see  Martens  and  Naes,  1989,  Manne,  1987, 
Helland,  1988,  de  Jong,  1993)  of  the  basic  algorithm  for  PLS  regression  originally  developed 
by  Wold  (1975).  In  its  basic  form  a  special  case  of  the  nonlinear  iterative  partial  least 
squares  (NIPALS)  algorithm  (Wold,  1966)  is  used.  NIPALS  is  a  robust  procedure  for 
solving  singular  value  decomposition  problems  and  is  closely  related  to  the  power  method 
(Golub  and  van  Loan,  1996).  After  an  initial  random  estimate  of  the  latent  vector  t  the 
following  two  steps  are  repeated  until  convergence  of  t  and  the  loadings  vector  p: 

1.  p  =  X^t 

2.  t  =  Xp,  t  <—  t/||t||  . 

After  the  extraction  of  t  and  p  vectors  the  matrix  X  is  deflated  by  t 

X  ^  X  -  tt'^X 

and  by  repeating  the  whole  procedure  we  may  extract  a  new  pair  of  vectors  t  and  p  which  are 
by  construction  orthogonal  to  the  previous  one.  It  is  worth  noting  that  in  the  case  that  N  < 
n  the  normalization  of  the  X-dimensional  vector  p  after  the  first  step  is  computationally 
advantageous  in  comparison  to  the  normalization  of  the  n-dimensional  vector  t.  However, 
the  normalization  of  t  allows  us  to  adapt  the  NIPALS  algorithm  to  extract  the  latent  vectors 
from  the  kernel  matrices  XX^  (Lewi,  1995): 

1.  p  =  XX'^t 

2.  t  =  p,  t  ^  t/||t||  . 

The  deflation  of  the  XX^  matrix  is  given  by 

XX^  ^  (X  -  tt'^X)(X  -  tt'^X)^  . 

Wold  et  al.  (1984)  applied  the  NIPALS  algorithm  to  the  PLS  regression  with  the  aim  to 
sequentially  extract  the  latent  vectors  t,  u  and  weight  vectors  w,  c  from  X  and  Y  matrices 
in  decreasing  order  of  their  corresponding  singular  values.  What  follows  is  a  modification  of 
the  “classical”  NIPALS-PLS  algorithm  in  the  sense  that  normalization  of  the  latent  vectors 
t,  u  rather  than  normalization  of  the  vectors  of  weights  w,  c  is  used  (Lewi,  1995): 
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1.  randomly  initialize  u 

2.  w  = 

3.  t  =  Xw,  t  <—  t/||t|| 

4.  c  =  Y'^t 

5.  u  =  Yc,  u  <—  u/||u|| 

6.  repeat  steps  2.-5.  until  convergence 

7.  deflate  X,Y  matrices:  X  ^  X  -  tt'^X,  Y  ^  Y  -  tt^Y  . 

The  PLS  regression  is  an  iterative  process;  i.e.  after  extraction  of  one  component  the 
algorithm  starts  again  using  the  deflated  matrices  X  and  Y  computed  in  step  7.  Thus  we 
can  achieve  the  sequence  of  the  models  up  to  the  point  when  the  rank  of  X  is  reached. 
However,  in  practice  the  cross-validation  technique  is  usually  used  to  avoid  underfitting 
or  overfitting  caused  by  the  use  of  too  small  or  too  large  dimensional  models.  After  the 
extraction  of  the  p  components  we  can  create  the  (n  x  p)  matrices  T,  U,  the  {N  x  p)  matrix 
W  and  the  (L  x  p)  matrix  C  consisting  of  the  columns  created  by  the  vectors 

and  respectively,  extracted  during  the  individual  iterations. 

The  PLS  regression  model  can  be  written  in  matrix  form  as  (Manne,  1987,  Rannar  et  ah, 
1994) 

Y  =  XB  +  F  , 

where  B  is  an  (N  x  L)  matrix  of  the  of  the  regression  coefficients  and  F  is  an  (n  x  L)  matrix 
of  residuals.  This  equation  is  identical  to  that  used  in  other  regression  models;  multiple 
linear  regression,  ridge  regression  and  principal  components  regression,  however,  in  contrast 
to  these  models  the  matrix  B  has  the  form  (Manne,  1987,  Rannar  et  ah,  1994) 

B  =  W(P^W)“iC^  ,  (4) 

where  P  is  the  {N  x  p)  matrix  consisting  of  loadings  vectors  {p*  =  X'^ti/(t^tj)}h^^.  Due 
to  the  fact  that  pfwj  =  0  for  f  >  j  and  in  general  pfwj  /  0  for  i  <  j  (Hoskuldsson, 
1988)  the  matrix  P^W  is  upper  triangular  and  thus  invertible.  Moreover,  using  the  fact 
that  tftj  =  0  for  i  ^  j  and  t^Uj  =  0  for  j  >  i  Rannar  et  al.  (1994)  derived  the  following 
equalities  ^ 

W  =  X'^U  (5) 

P  =  X'^T(T'^T)~i  (6) 

C  =  Y'^T(T'^T)-i  .  (7) 

Substituting  (5-7)  into  (4)  and  using  the  orthogonality  of  the  T  matrix  columns  we  can 
write  the  matrix  B  in  the  following  form 

B  =  X^U(T^XX'^U)”^T'^Y  .  (8) 

It  is  worth  noting  that  different  scalings  of  the  individual  latent  vectors  and 

do  not  influence  this  estimate  of  the  matrix  B. 

3.  In  our  case  T^T  is  the  p-dimensional  identity  matrix.  This  is  simply  a  consequence  of  normalization  of 
the  individual  latent  vectors 
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4.  Kernel  Partial  Least  Squares  Regression  in  RKHS 

Assume  a  nonlinear  transformation  of  the  input  variables  into  a  feature  space  i.e. 

mapping  :  x*  G  ‘h(xj)  G  T .  Our  goal  is  to  construct  a  linear  PLS  regression  model 

in  T .  Effectively  it  means  that  we  can  obtain  a  nonlinear  regression  model  in  the  space 
of  the  original  input  variables.  Denote  by  $  an  (n  x  M)  matrix  of  regressors  whose  i-th 
row  is  the  vector  ^(x*).  Depending  on  the  nonlinear  transformation  <h(.)  the  feature  space 
can  by  high-dimensional,  even  infinite  dimensional  when  the  Gaussian  kernel  function  is 
used.  However,  in  practice  we  are  working  only  with  n  observations  and  we  have  to  restrict 
ourself  to  finding  the  solution  of  the  linear  regression  problem  in  the  span  of  the  points 
{<h(xj)}^^.  This  situation  is  analogous  to  the  case  when  the  input  data  matrix  X  has 
more  columns  than  rows;  i.e.  we  are  dealing  with  more  variables  than  measured  objects. 
This  motivated  Rannar  et  al.  (1994)  to  introduced  the  (input  space)  kernel  PLS  algorithm 
to  speed  up  the  computation  of  the  components  for  a  linear  PLS  model.  The  idea  is  to 
compute  the  components  from  the  {n  x  n)  XX^  matrix  rather  than  the  {N  x  N)  X'^X  matrix 
when  n  N.  The  same  approach  can  be  also  used  for  the  computation  of  the  principal 
components  (see  Wu  et  al.,  1997a)  or  the  nonlinear  version  (Scholkopf  et  al.,  1998). 

Now,  motivated  by  the  theory  of  RKHS  described  in  Section  2  we  derive  the  algorithm 
for  the  (nonlinear)  kernel  PLS  model.  From  the  previous  section  we  can  see  that  by  the 
connection  of  2.  and  3.  step  and  by  using  the  $  matrix  of  mapped  input  data  we  can 
modify  the  NIPALS-PLS  algorithm  into  the  form 

1.  randomly  initialize  u 

2.  t  =  t  ^  t/||t|| 

3.  c  =  Y'^t 

4.  u  =  Yc,  u  <—  u/||u|| 

5.  repeat  steps  2.-5.  until  convergence 

6.  deflate  matrices:  ^  ($  —  tt'^$)($  —  Y  <—  Y  —  tt'^Y  . 

Applying  the  so-called  “kernel  trick”;  i.e.  the  fact  that  4>(xj)^4>(xj)  =  iP(xj,Xj),  we  can  see 
that  represents  the  {n  x  n)  kernel  Gram  matrix  K  of  the  cross  dot  products  between 

all  mapped  input  data  points  {$(xj)}(L^.  Thus,  instead  of  an  explicit  nonlinear  mapping, 
the  kernel  function  can  be  used.  The  deflation  of  the  =  K  matrix  after  extraction  of 

the  t  component  is  now  given  by 

K  ^  (I  -  tt'^)K(I  -  tt'^)  =  K  -  tt'^K  -  Ktt^  -h  tt^Ktt^  ,  (9) 

where  I  is  an  n-dimensional  identity  matrix.  We  would  like  to  point  out  that  a  similar  kernel 
PLS  algorithm  can  be  also  derived  by  the  nonlinear  modification  of  the  (linear)  kernel  PLS 
algorithm  described  in  (Rannar  et  ah,  1994).  This  modification  leads  to  the  extraction  of 
the  t,u  components  from  the  KYY^  and  YY^  matrices,  however  this  approach  can  be 
more  fruitful  when  the  multivariate  kernel  PLS  model  is  assumed  {L  >  1)  as  compared  to 
the  NIPALS-PLS  algorithm  described  above. 
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Similarly  we  can  see  that  the  matrix  of  the  regression  coefficients  B  (8)  will  have  the 
form 

B  =  $^U(T'^KU)“^T'^Y  (10) 

and  to  make  prediction  on  training  data  we  can  write 

Y  =  $B  =  KU(T'^KU)“^T'^Y  =  TT^Y  ,  (11) 

where  the  last  equality  follows  from  the  fact  that  the  matrix  of  the  components  T  may  be 
expressed  as  T  =  $R  where  R  =  $'^U(T^KU)“^  (de  Jong,  1993,  Holland,  1988).  It  is 
important  to  stress  that  during  the  iterative  process  of  the  estimation  of  the  components 
"'^6  made  the  deflation  of  the  K  matrix  after  each  step.  Effectively  it  means  that 
T  7^  KU.  Thus,  for  predictions  made  on  testing  points  the  matrix  of  regression 

coefficients  (10)  have  to  be  used;  i.e. 

Yt  =  $tB  =  KtU(T^KU)”^T^Y  ,  (12) 

where  is  the  matrix  of  the  mapped  testing  points  and  consequently  Kt  is  the  (nt  x  n) 
“test”  matrix  whose  elements  are  Kjj  =  K(xj,Xj)  where  and  are  the 

testing  and  training  points,  respectively. 

At  the  beginning  of  the  previous  section  we  assumed  a  centralized  PLS  regression  prob¬ 
lem.  To  centralize  the  mapped  data  in  a  feature  space  T  we  can  simply  applied  the  following 
procedures  (Scholkopf  et  ah,  1998,  Wu  et  ah,  1997b) 

K  =  (I--1„0k(I--1„0  (13) 

n  n 

Kt  =  (K*  -  iln,l^K)(I  -  ,  (14) 

where  I  is  again  an  n-dimensional  identity  matrix  and  In,  Int  represent  the  vectors  whose 
elements  are  ones,  with  length  n  and  nt,  respectively. 

In  conclusion  we  would  like  to  make  several  remarks  about  the  interpretation  of  the 
kernel  PLS  model.  For  simplicity  we  will  consider  the  univariate  kernel  PLS  regression  case 
(L  =  1)  and  we  denote  the  (n  x  1)  vector  d  =  U(T'^KU)“^T'^Y.  Now  we  can  represent 
the  solution  of  the  kernel  PLS  regression  as 


/(x,d)  =  ^djA:(x,Xj)  , 

i=l 

which  agrees  with  the  solution  of  the  regularized  formulation  of  regression  (2)  given  by  the 
Representer  theorem  in  Section  2.  Using  equation  (11)  we  may  also  interpret  the  kernel 
PLS  model  as  a  linear  regression  model  of  the  form  (for  more  detailed  interpretation  of 
linear  PLS  models  we  refer  the  reader  to  Garthwaite,  1994,  Hoskuldsson,  1988) 

/(x,  c)  =  Cltl(x)  -h  C2t2(x)  -h  .  .  .  -h  Cptp(x)  =  C^t(x)  , 

where  the  {L(x)}h^j^  are  the  projections  of  the  data  point  x  onto  the  extracted  p  components 
and  c  is  the  vector  of  weights  given  by  (7). 
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Figure  1:  First  three  principal  components  (PC)  extracted  by  kernel  PCA  (left)  and  components 
(C)  extracted  by  kernel  PLS  (right).  The  curves  represents  the  (principal)  component 
values.  Sine  function  is  shown  as  a  solid  line. 


4.1  Example 

In  this  section  we  would  like  to  demonstrate  some  of  the  properties  of  the  kernel  PLS  when 
applied  to  approximating  the  sinc{x)  function  defined  as 

sin\x 

x)  =  smclx)  =  — I — j— 

|x| 

We  generated  100  uniformly  spaced  samples  in  the  range  [—10, 10]  and  computed  the  cor¬ 
responding  values  of  the  sinc{.)  function  which  were  subsequently  centralized.  We  used  the 
Gaussian  kernel  function  with  width  equal  to  1.  In  Figure  1  the  first  3  principal  components 
(left)  computed  from  the  centralized  Gram  matrix  K  are  compared  with  the  components 
extracted  by  the  proposed  kernel  PLS  algorithm  (right).  We  can  see  the  qualitative  dif¬ 
ference  between  the  principal  components  and  components  extracted  by  kernel  PLS  where 
also  the  correlations  between  the  input  and  output  data  are  used. 

In  the  next  step  we  added  the  white  Gaussian  noise  with  standard  deviation  0.2  to  the 
outputs.  This  corresponds  to  the  ratio  between  the  standard  deviation  of  noise  and  signal 
equal  to  56%.  We  generated  an  additional  80  uniformly  spaced  testing  samples  from  the 
same  range  [—10, 10],  however  not  identical  to  the  previous  ones.  Using  these  data  points 
we  computed  noise-free  outputs.  The  results  obtained  on  training  and  testing  parts  of  the 
data  are  depicted  in  Figure  2.  We  can  see  that  although  the  kernel  PLS  method  fits  more 
precisely  noisy  training  data  by  the  appropriate  selection  of  the  components  we  can  avoid  the 
overfitting  effect  and  achieve  the  same  performance  on  the  testing  set  as  with  the  kernel  PGR 
method.  The  number  of  components  in  the  kernel  PLS  case  is  significantly  smaller.  Using 
the  number  of  components  on  which  minimum  test  error  occurred,  in  Figure  3,  we  plotted 
approximating  functions  computed  on  noisy  training  data.  We  achieved  qualitatively  similar 
results  in  the  case  when  the  Gaussian  noise  with  standard  deviation  equal  to  0.05  and  0.1 
was  added  to  the  outputs. 
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Figure  2:  Sine  function  approximation.  Dependence  of  the  training  and  testing  error  of  kernel 
PLS  (solid  line)  and  kernel  PCR  (dashed  line)  on  the  number  of  extracted  components. 
Gaussian  noise  with  standard  deviation  equal  to  0.2  was  added  to  the  training  data  set 
outputs.  Error  is  evaluated  in  terms  of  mean  squared  error  (MSE). 


Figure  3:  Comparison  of  kernel  PLS  (solid  line)  and  kernel  PCR  (dash-dotted  line)  on  noisy  sine 
function  approximation.  The  number  of  used  (principal)  components  was  selected  based 
on  plots  in  Figure  2  (right)  and  was  1  and  10  in  kernel  PLS  and  kernel  PCR,  respectively. 
Gaussian  noise  with  standard  deviation  equal  to  0.2  was  added  to  the  data  set  outputs 
(dots).  The  true  function  is  shown  dashed. 


106 


Kernel  Partial  Least  Squares  Regression  in  RKHS 


5.  Kernel  PCR  and  Kernel  RR  in  RKHS 

In  this  section,  we  briefly  describe  and  compare  another  two  kernel-based  regularized  least 
squares  models;  kernel  PCR  and  kernel  RR.  However,  we  start  with  the  kernel  PCA  method 
for  the  extraction  of  nonlinear  principal  components  used  in  the  kernel  PCR  model. 


5.1  Kernel  PCA 

The  PCA  problem  in  a  high-dimensional  feature  space  T  can  be  formulated  as  the  diago- 
nalization  of  an  n-sample  estimate  of  the  covariance  matrix 

1  JC  1 

'  ^  T  ^  ^ 

It  ' — '  n 

i=\ 

where  $  now  denotes  the  (n  x  M)  matrix  of  the  centered  nonlinear  mappings  of  the  input 
variables  G  .  The  diagonalization  represents  a  transformation  of  the  original 

data  to  new  coordinates  defined  by  orthogonal  eigenvectors  u.  We  have  to  find  eigenvalues 
A  >  0  and  non-zero  eigenvectors  u  G  satisfying  the  eigenvalue  equation 

Au  =  Cu  . 


c  =  -  y 


When  n  <C  M,  similar  to  linear  kernel  PCA  (see,  e.g.,  Wu  et  ah,  1997a),  we  may  transform 
this  eigenvalue  problem  to  the  problem  of  diagonalization  of  the  (n  x  n)  matrix  =  K; 
i.e.  solving  the  eigenvalue  problem  as  described  in  (Scholkopf  et  ah,  1998,  Sirovich,  1987) 

Ku  =  nAu  =  Au  .  (15) 

The  eigenvectors  are  then  given  by 

=  (nAfc)-V2$'rufc  =  , 

where  is  the  k-th  principal  component  extracted  by  solving  (15)  and  nXk  =  Xk  the 
corresponding  eigenvalue.  Finally,  we  can  compute  the  projection  of  <h(x)  onto  the  k-th 
nonlinear  principal  component  by 

n 

/3fc(x)  =  d>(x)'^u''  =  A“^/^  u^K{xi,  x)  .  (16) 

i=l 

Re-writing  this  projection  into  the  matrix  form  we  can  write  for  the  projection  of  training 
data  points  {xj}(T^ 

P  =  ^  kUA-^/2  ^  ua^/2  ^  (';L7) 

where  the  columns  of  U  are  created  by  the  eigenvectors  and  A  is  a  diagonal  matrix 

diag{Xi,  X2,  ■  ■  ■ ,  Xn)-  Similarly  for  the  projection  of  testing  points  we  have 

P^  =  ^  KtUA-^/2  ^ 

Note  that  the  assumption  of  the  centralized  nonlinear  mappings  is  again  transformed  to  the 
“centralization”  of  the  K  and  matrices  given  by  (13)  and  (14),  respectively. 


107 


Rosipal  and  Trejo 


5.2  Kernel  PCR 

Consider  the  standard  regression  model  in  feature  space  T 

y  =  $w  +  e,  (18) 

where  y  is  a  vector  of  n  observations  of  the  dependent  variable,  $  now  represents  an  (n  x  M) 
matrix  of  zero-mean  regressors  w  is  a  vector  of  regression  coefficients  and  e  is 

the  vector  of  error  terms  whose  elements  have  equal  variance  and  are  independent  of 
each  other.  is  proportional  to  the  sample  covariance  matrix  and  kernel  PCA  can  be 

performed  to  extract  its  M  eigenvalues  and  corresponding  eigenvectors 

(15).  Having  the  eigensystem  {Aj,  the  spectral  decomposition  (Jolliffe,  1986)  of 

has  the  form 

M 

=  ^Xiu\uY  . 

i=l 

The  projection  of  ‘h(x)  onto  the  k-th  nonlinear  principal  component  is  given  by  (16).  By 
projection  of  all  original  regressors  onto  the  principal  components  we  can  rewrite  (18)  as 

y  =  Bv-he,  (19) 

where  B  =  is  now  an  (n  x  M)  matrix  of  transformed  regressors  and  U  is  an  (M  x  M) 
matrix  whose  /c-th  column  is  the  eigenvector  u^.  The  columns  of  the  matrix  B  are  now 
orthogonal  and  the  least  squares  estimate  of  the  coefficients  v  becomes 

V  =  (B^B)-iB^y  =  A'^B^y  , 

where  A  =  diag(Ai,  A2, . . . ,  Am)-  It  is  worth  noting  that  PCR,  as  well  as  other  biased  re¬ 
gression  techniques,  is  not  invariant  to  the  relative  scaling  of  the  original  regressors  (Frank 
and  Friedman,  1993).  However,  similar  to  OLS  regression,  the  solution  of  (19)  does  not 
depend  on  a  possibly  different  scaling  in  individual  eigendirections  used  in  the  kernel  PCA 
transformation.  Further,  the  results  obtained  using  all  principal  components  for  the  pro¬ 
jection  of  the  original  regressor  variables  for  (19)  is  equivalent  to  that  obtained  by  least 
squares  using  the  original  regressors.  In  fact  we  can  express  the  estimate  w  of  the  original 
model  (18)  as 

M 

w  =  Uv  =  U(B^B)-^B^y  =  = 

i=l 

and  its  corresponding  variance-covariance  matrix  (Jolliffe,  1986)  as 

M 

cor;(w)  =  =  \“^u*(u*)^  ,  (20) 

i=l 

where  we  used  the  fact  that  y  ~  AA($w,cj^I).  Similarly  to  PLS,  to  avoid  the  problem  of 
multicollinearity,  PCR  uses  only  some  of  the  principal  components.  It  is  clear  from  (20) 

4.  For  the  moment,  we  are  theoretically  assuming  that  n  >  M.  Otherwise  we  have  to  deal  with  a  singular 
case  (n  <  M)  allowing  us  to  extract  only  up  to  n  —  1  eigenvectors  corresponding  to  non-zero  eigenvalues. 
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that  the  influence  of  small  eigenvalues  can  significantly  increase  the  overall  variance  of  the 
estimate.  PCR  simply  deletes  the  principal  components  corresponding  to  small  values  of 
the  eigenvalues  Aj.  The  penalty  we  have  to  pay  for  the  decrease  in  variance  of  the  regression 
coefficient  estimate  is  bias  in  the  final  estimate.  However,  if  multicollinearity  is  a  serious 
problem  the  introduced  bias  can  have  a  less  significant  effect  than  a  high  variance  estimate. 
If  the  elements  of  v  corresponding  to  deleted  regressors  are  zero,  an  unbiased  estimate  is 
achieved  (Jolliffe,  1986). 

Using  the  first  p  nonlinear  principal  components  (16)  to  create  a  linear  model  based 
on  orthogonal  regressors  in  feature  space  J-  we  can  formulate  the  kernel  PCR  model  as 
(Rosipal  et  ah,  2000a,  2001) 

p  p  n  n 

/(x,a)  =  ^Ufc/3fc(x)  +  6  =  ^Vk^Xl^^^UiK{xi,x)  =  ^aiiP(xj,x)  +  b  , 

k=l  k=l  i=l  i=l 

where  {uj  =  6  is  a  bias  term.  Similar  to  kernel  PLS  and  kernel 

RR  (see  below)  we  can  assume  a  centralized  regression  model  leading  to  a  zero  bias  term  b. 

We  have  shown  that  by  removing  the  principal  components  whose  variances  are  very 
small  we  can  eliminate  large  variances  of  the  estimate  due  to  multicollinearities.  However, 
if  the  orthogonal  regressors  corresponding  to  those  principal  components  have  a  large  cor¬ 
relation  with  the  dependent  variable  y  such  deletion  is  undesirable  (experimentally  demon¬ 
strated  in  Rosipal  et  ah,  2000b).  There  are  several  different  strategies  for  selecting  the 
appropriate  orthogonal  regressors  for  the  final  model  (see  Jolliffe,  1986,  1982,  and  refer¬ 
ences  therein).  In  Section  6  we  discuss  approaches  used  in  our  experiments. 

5.3  Kernel  Ridge  Regression 

Kernel  RR  is  another  technique  to  deal  with  multicollinearity  by  assuming  the  linear  re¬ 
gression  model  (18)  whose  solution  is  now  achieved  by  minimizing 

n 

Rrr{w)  =  ^{Vi  -  /(Xi,  w))2  -F  ^||wf  ,  (21) 

i=l 

where  /(x,w)  =  w^d>(x)  and  ^  is  a  regularization  coefficient.  The  least  squares  estimate 
of  w  is 

w  =  , 

which  is  biased  but  has  lower  variance  compared  to  an  OLS  estimate.  To  make  the  connec¬ 
tion  to  the  kernel  PCR  case  we  express  the  estimate  w  in  the  eigensystem 

M 

w  =  ^(Ai 
i=\ 

and  corresponding  variance-covariance  matrix  as  (Jolliffe,  1986) 

M 

cov{w)  =  ^  Ai(Ai  -h  C)“^W(u*)^  . 

i=l 
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We  can  see,  that  in  contrast  to  kernel  PCR  (20),  the  variance  reduction  in  kernel  RR  is 
achieved  by  giving  less  weight  to  small  eigenvalue  principal  components  via  the  factor 

In  practice  we  usually  do  not  know  the  explicit  mapping  $(.)  or  its  computation  in  the 
high- dimensional  feature  space  T  may  be  numerically  intractable.  Using  the  dual  represen¬ 
tation  of  the  linear  RR  model,  Saunders  et  ah  (1998)  derived  a  formula  for  estimation  of  the 
weights  w  for  the  linear  RR  model  in  a  feature  space  i.e.  (nonlinear)  kernel  RR.  Again, 
using  the  fact  that  A(x,  y)  =  $(x)^^>(y)  we  can  express  the  final  kernel  RR  estimate  of 
(21)  in  the  dot  product  form  (Saunders  et  ah,  1998,  Cristianini  and  Shawe-Taylor,  2000) 

/(x)=c^k  =  y'^(K  +  ^/)-ik, 

where  K  is  again  an  (n  x  n)  Gram  matrix  and  k  is  the  vector  of  dot  products  of  a  new 
mapped  input  example  ‘h(x)  and  the  vectors  of  the  training  set;  kj  =  (<h(xj).<I>(x)).  It  is 
worth  noting  that  the  same  solution  to  the  RR  problem  in  the  feature  space  T  can  also 
be  derived  based  on  the  dual  representation  of  the  regularization  networks  minimizing  the 
regularized  risk  functional  (1)  using  the  quadratic  loss  function  V{yi,  /(xj))  =  (y*  —  /(x*))^ 
(Girosi  et  ah,  1993,  1995,  Haykin,  1999)  or  through  the  techniques  derived  from  Gaussian 
processes  (Williams,  1998,  Gristianini  and  Shawe-Taylor,  2000). 

In  this  paper  we  assume  centralized  kernel  RR  (Rosipal  et  ah,  2000b);  i.e.  we  assume 
the  sample  mean  of  the  mapped  data  ‘h(xj)  and  outputs  y*  to  be  zero.  The  centralization 
of  the  individual  mapped  data  points  is  again  accomplished  by  the  “centralization”  of  K 
and  k  given  by  the  equations  (13)  and  (14),  respectively. 

6.  Model  Selection 

To  determine  unknown  parameters  in  all  regression  models,  cross  validation  (GV)  techniques 
were  used  (Stone,  1974).  While  in  kernel  RR,  a  regularization  coefficient  and  parameters 
of  the  kernel  function  have  to  be  estimated,  in  kernel  PLS  and  kernel  PGR  it  is  mainly  the 
problem  of  appropriate  selection  of  (principal)  components. 

For  a  comparison  of  models  using  particular  values  of  estimated  parameters,  the  predic¬ 
tion  error  sum  of  squares  (PRESS)  statistic  was  used. 


PRESS  =  -  /(x,))2  , 

i=l 

where  /(x*)  represents  the  prediction  of  the  measured  response  y*.  PRESS  was  summed 
over  all  GV  subsets. 

6.1  Kernel  PLS 

In  kernel  PLS  the  number  of  components  gradually  increases  until  the  model  reaches  some 
optimal  dimension.  Por  example  we  can  use  GV  to  determine  the  adequacy  of  the  individual 
components  to  enter  the  final  model  (Wold,  1978)  or  use  GV  for  the  comparison  of  whole 
models  of  certain  dimensionality  1, 2, . . .  ,y.  In  our  study  we  used  the  second  approach  and 
the  validity  of  individual  models  was  compared  in  terms  of  PRESS. 
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6.2  Kernel  PCR 

The  situation  is  more  difficult  in  the  case  of  kernel  PCR  than  in  kernel  PLS  because  prin¬ 
cipal  components  are  extracted  solely  based  on  the  description  of  the  input  space  without 
using  any  existing  correlations  with  the  outputs.  The  influence  of  individual  principal  com¬ 
ponents  regressors  can  be  consequently  measured  by  the  t-test  for  regression  coefficients 
(Montgomery  and  Peck,  1992).  By  assuming  a  centralized  regression  model  (19)  for  which 
the  design  matrix  B  satisfies  =  I,  we  can  write  for  the  statistic  of  k-ih.  regressor 

t1  =  (/3^Y)^,  where  (3f,  represents  the  (n  x  1)  vector  of  the  projections  of  input  data  onto  the 
A:-th  principal  component.  The  condition  B^B  =  I  simply  means  sphering  of  the  projected 
data  which  can  be  achieved  on  the  training  data  by  taking  P  =  U  in  equation  (17). 

There  are  several  different  situations  that  can  occur  in  PCR.  First,  the  principal  direc¬ 
tions  with  large  eigenvalues  and  significant  values  of  should  always  be  used  in  the  final 
model.  The  principal  directions  with  high  eigenvalues  and  insignificant  values  of  should 
also  be  included  in  the  final  model  due  to  the  fact  that  a  significant  amount  of  variability 
of  the  input  data  can  be  lost.  The  principal  directions  with  low  eigenvalues  and  insignif¬ 
icant  values  of  should  always  be  deleted.  The  most  difficult  problems  arise  when  some 
of  the  directions  with  small  eigenvalues  have  a  significant  contribution  to  prediction.  This 
situation  on  two  data  sets  used  was  already  demonstrated  in  (Rosipal  et  ah,  2000b,  2001). 
Moreover,  in  Figure  4  we  also  give  one  of  the  examples  we  observed  on  the  used  data  sets. 
Comparing  the  left  and  right  graphs  we  can  see  that  some  of  the  small  eigenvalues  principal 
components  may  have  relatively  high  prediction  properties.  In  contrast  we  can  see  that 

values  of  some  high-eigenvalue  principal  components  indicate  their  low  contribution  to 
the  overall  prediction  abilities  of  the  regression  model.  For  further  discussion  on  the  topic 
of  principal  components  selection  we  refer  the  reader  to  (Jolliffe,  1986,  Stone  and  Brooks, 
1990). 

First,  we  would  like  to  stress  that  as  a  consequence  of  the  orthogonality  of  regressors  the 
individual  single  variable  models  have  an  independent  contribution  to  the  overall  regression 
model.  This  significantly  simplifies  the  selection  of  individual  regressors  and  in  our  study 
we  decided  on  the  following  model  selection  strategy.  We  were  iteratively  increasing  the 
number  of  large  eigenvalue  principal  components  entering  the  model  without  considering 
their  values  of  the  statistic.  The  criterion  employed  was  the  amount  of  described  variance. 
The  rest  of  the  principal  components  were  ordered  based  on  the  statistic.  As  with  kernel 
PLS,  CV  was  used  to  compare  the  whole  models  of  particular  dimension.  However,  in 
contrast  to  kernel  PLS  the  PRESS  statistics  were  used  to  select  the  final  model  over  all 
possible  arrangements  of  the  final  models;  i.e.  for  a  different,  fixed  number  of  principal 
components  with  large  eigenvalues  entering  the  final  model. 

7.  Results 

The  present  work  was  carried  out  with  two  types  of  kernels.  Gaussian  kernels  iP(x,  y)  = 

/  ||x  — y||^  N  _  _ 

d  where  d  determines  the  width  of  the  Gaussian  function.  The  Gaussian  kernel 
possesses  good  smoothness  properties  (suppression  of  the  higher  frequency  components) 
and  in  the  case  we  do  not  have  a  priori  knowledge  about  the  regression  problem  we  would 
prefer  a  smooth  estimate  (Girosi  et  ah,  1993,  Smola  et  ah,  1998).  The  polynomial  kernels 
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Figure  4:  Example  of  the  estimated  ordered  eigenvalues  (left)  and  statistic  (right)  of  the  regres¬ 
sors  created  by  the  projection  onto  corresponding  principal  components.  Vertical  dashed 
lines  indicate  a  number  of  principal  components  describing  95%  of  the  overall  variance  of 
the  data.  One  of  the  training  data  partitions  of  subject  D  from  the  regression  problem 
described  in  Section  7.4  was  used. 


itr(x,  y)  =  ((x.y)  -|- 1)“  of  different  orders  a  were  also  employed.  In  the  case  of  a  =  1  we  used 
the  /C(x,  y)  =  (x.y)  =  x^y  kernel  leading  to  the  construction  of  linear  regression  models. 
The  results  were  evaluated  in  terms  of  (coefficient  of  determination)  (Montgomery  and 
Peck,  1992)  or  in  terms  of  normalized  root  mean  squared  error  (NRMSE). 

7.1  Corn  Data 

This  data  set  consists  of  80  samples  of  corn  measured  on  3  different  near-infra-red  spectrom¬ 
eters  (in  our  study  spectra  from  instrument  m5  were  used)  and  is  electronically  available 
at  http://www.eigenvector.com/Data/Data_sets.html.  The  wavelength  range  is  1100- 
2498nm  at  2  nm  intervals  {N  =  700  channels).  The  moisture,  oil,  protein  and  starch  values 
represent  four  output  variables  (L  =  4).  As  the  first  principal  component  described  99% 
of  the  overall  variance  this  indicated  high  multicollinearity  among  the  input  variables.  We 
have  used  the  spectra  to  form  the  matrix  of  regressors  X,  however,  instead  of  modeling  the 
real  responses  we  generated  four  different  outputs  as  follows 

yi  =  exp(Cf)  2/2  =  exp(5^^^^) 

2/3  =  (^)^exp(^)  y4  =  0.32/1  -b  0.252/2  -  O.T/zs  ■ 

A  is  a  symmetric  matrix  with  off-diagonal  elements  set  to  0.8  and  diagonal  elements  set 
to  1.0.  m  and  mi  are  averages  of  {x^Xj}J£‘{  and  {x^A~^Xj}[£*{,  respectively.  The  first  60 
samples  were  used  to  create  a  training  data  set  and  the  remaining  20  samples  created  a 
testing  set.  To  the  outputs  computed  on  training  data  we  added  white  noise  with  normal 
distribution  and  with  different  levels  corresponding  to  ratios  of  the  standard  deviation  of 
the  noise  and  the  clean  output  variables.  We  denote  this  ratios  n/s.  For  each  noise  level  25 
different  training  sets  were  generated. 
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Leave-one-out  cross  validation  (LOO)  applied  on  the  training  partition  was  used  to 
select  desired  parameters  for  individual  kernel  regression  models.  In  the  case  of  kernel  PCR 
the  first  principal  component  was  always  included  into  the  regression  models  and  only  the 
first  30  principal  components  covering  almost  the  whole  data  variance  were  investigated. 

Both  multivariate  and  univariate  regression  models  were  studied.  Multivariate  kernel 
PLS  tries  to  find  components  that  are  good  predictors  for  all  response  variables.  In  the 
final  model  the  same  components  are  used  for  the  prediction  of  individual  responses.  These 
components  are  determined  based  on  the  PRESS  statistic  computed  as  the  summation  of 
squared  errors  over  all  response  variables.  Multivariate  kernel  PLS  might  be  advantageous 
especially  when  predicted  response  variables  are  highly  multicollinear.  We  observed  that 
both  approaches  lead  to  the  same  results  on  data  sets  with  a  smaller  noise  level,  however, 
for  noise  levels  equal  to  njs  =  60%  and  njs  =  90%  multivariate  kernel  PLS  provides 
superior  predictions  on  the  test  set.  Multivariate  RR  approach  was  discussed  by  Frank  and 
Friedman  (Frank  and  Friedman,  1993).  It  was  shown  that  if  the  original  response  variables 
are  correlated  it  might  be  profitable  to  assume  separate  RR  or  PCR  regression  models 
on  decorrelated  outputs  rather  than  on  the  original  responses.  However,  by  applying  this 
method  we  did  not  observe  an  improvement  on  test  set  predictions. 

The  goodness  of  prediction  of  the  univariate  kernel  RR,  kernel  PCR  and  kernel  PLS 
regression  models  on  the  test  set  are  summarized  in  Table  1.  We  may  observe  compara¬ 
ble  performance  of  all  three  kernel  regression  techniques  employed.  However,  the  results 
achieved  with  multivariate  kernel  PLS  regression  were  superior  to  the  kernel  PCA  and  ker¬ 
nel  RR  models  in  the  case  of  a  higher  noise  level  nj s  =  90%.  In  this  case  multivariate  kernel 
PLS  resulted  in  values  equal  to  0.95  and  0.93  for  the  prediction  of  yi  and  y2  response 
variables  on  the  test  set.  Increasing  the  noise  level  leads  to  the  selection  of  a  smaller  number 
of  (principal)  components.  Although  we  are  losing  the  possibility  of  finer  approximation  of 
the  function  those  components,  especially  in  the  case  of  kernel  PLS,  may  still  give  relatively 
good  performance  even  in  the  situation  where  there  is  a  higher  noise  level.  However,  we  have 
to  note  that  this  is  due  to  the  relatively  simple  nonlinear  dependencies  that  we  constructed 
by  (22).  In  the  nonlinear  case,  on  average,  univariate  kernel  PLS  uses  less  than  80%  of  the 
components  used  by  kernel  PCR.  However,  in  the  case  of  linear  regression  (i.e.  using  the 
polynomial  kernel  K(x,y)  =  (x.y))  the  number  of  used  components  is  approximately  the 
same. 

We  would  like  to  note  that  using  the  original  response  variables  (moisture,  oil,  protein, 
starch)  we  did  not  achieve  significantly  better  results  using  polynomial  kernels  of  higher 
orders  compared  to  linear  regression. 

7.2  Polymer  Data 

This  data  set  is  taken  from  a  polymer  test  plant  (Ungar,  1995).  There  are  10  input  variables 
(N  =  10),  measurements  of  controlled  variables  in  a  polymer  processing  plant  (tempera¬ 
tures,  feed  rates,  etc.),  and  4  output  variables  {L  =  4)  are  measures  of  the  output  of  that 
plant.  It  is  claimed  that  this  data  set  is  particularly  good  for  testing  the  robustness  of 
nonlinear  modeling  methods  to  irregularly  spaced  data. 

We  first  took  41  samples  as  training  and  the  remaining  20  samples  as  testing  data.  LOO 
applied  on  the  training  partition  was  used  to  select  the  desired  parameters  for  individual 
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Noise 

kernel  PLS 

kernel  PCR 

kernel  RR 

level 

2/1  2/2  2/3  2/4 

2/1  2/2  2/3  2/4 

2/1  2/2  2/3  2/4 

K{x,y)  =  (x 

:.y) 

15% 

.96 

.95 

.76 

.75 

.95 

.94 

.76 

.72 

.96 

.95 

.76 

.74 

30% 

.94 

.93 

.74 

.70 

.92 

.93 

.75 

.73 

.94 

.94 

.74 

.71 

60% 

.86 

.87 

.75 

.73 

.81 

.89 

.67 

.71 

.88 

.89 

.74 

.73 

90% 

.77 

.75 

.70 

.73 

.69 

.77 

.62 

.60 

.78 

.81 

.69 

.73 

K{x 

,y)  = 

((x.y)  + 1) 

2 

15% 

.98 

.99 

.96 

.97 

.98 

.97 

.93 

.94 

.98 

.98 

.96 

.97 

30% 

.96 

.96 

.91 

.94 

.96 

.94 

.87 

.88 

.97 

.97 

.91 

.94 

60% 

.87 

.89 

.77 

.85 

.85 

.88 

.71 

.76 

.90 

.92 

.84 

.85 

90% 

.73 

.77 

.70 

.79 

.70 

.78 

.55 

.70 

.78 

.84 

.80 

.82 

K{x 

,y)  = 

((x.y)  +  1) 

3 

15% 

.99 

.99 

.98 

.98 

.99 

.99 

.96 

.97 

.99 

.99 

.97 

.98 

30% 

.98 

.97 

.94 

.95 

.97 

.97 

.89 

.91 

.98 

.97 

.95 

.96 

60% 

.93 

.89 

.88 

.88 

.88 

.88 

.77 

.82 

.93 

.90 

.90 

.91 

90% 

.85 

.77 

.85 

.85 

.72 

.73 

.61 

.79 

.83 

.77 

.86 

.86 

Table  1:  Goodness  of  prediction  in  E?  terms.  The  values  correspond  to  an  average  of  25  different 
simulations.  Noise  level  is  represented  as  the  ratio  between  the  standard  deviation  of 
the  added  Gaussian  noise  and  the  standard  deviation  of  the  generated  response  variables 
2/i,J/2,2/3,J/4  (22). 


kernel  regression  models.  Polynomial  kernels  of  different  orders  were  used.  The  indication  of 
high  multicollinearity  (the  ratio  between  the  first  and  fourth  eigenvalues  of  the  covariance 
matrix  of  the  response  variables  equals  628)  among  the  response  variables  suggests  that 
assuming  a  multivariate  regression  approach  may  by  profitable  in  this  case. 

Table  2  compares  the  goodness  of  prediction  of  the  multivariate  kernel  PLS  on  the  test 
set  with  the  kernel  RR  and  kernel  PCR  regression  models  used  on  decorrelated  outputs. 
The  results  applying  univariate  regression  models  on  the  same  data  set  are  summarized  in 
Table  3. 

We  can  see  that  all  three  regression  models  achieved  comparable  results  when  the  best 
predictions  on  individual  response  variables  are  compared.  We  may  see  that  univariate 
kernel  PLS  provides  better  prediction  on  the  responses  yi  and  1/2  while  its  multivariate 
modification  seems  to  be  better  for  the  prediction  of  2/3  and  1/4.  Univariate  kernel  RR  and 
kernel  PCR  are  better  on  the  prediction  of  2/2,  2/3  and  2/4,  while  decorrelation  of  the  outputs 
may  improve  the  prediction  of  2/1  •  However,  we  have  to  say  that  these  results  may  also 
depend  on  the  model  selection  criterion  used.  Further  experiments  using  different  data  sets 
and  model  selection  criteria  will  have  to  be  employed  to  provide  a  more  valuable  comparison. 
Results  also  suggest  that  on  this  data  set  kernel  PLS  provides  the  most  consistent  results 
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over  the  range  of  different  orders  of  the  polynomial  kernel.  Simulations  with  Gaussian 
kernels  of  different  widths  did  not  lead  to  better  performance. 


Order 

a 

kernel  PLS 

kernel  PCR 

kernel  RR 

2/1 

2/2 

2/3 

2/4 

2/1 

2/2 

2/3 

2/4 

2/1 

2/2 

2/3 

2/4 

1 

.43 

.06 

.73 

.77 

.23 

.68 

.83 

.79 

.36 

.21 

.79 

.79 

2 

.90 

.30 

.89 

.88 

0* 

.0* 

.63 

.62 

.0* 

.0* 

.68 

.86 

3 

.84 

.57 

.90 

.90 

.89 

.33 

.60 

.68 

.86 

.0* 

.58 

.54 

4 

.79 

.69 

.87 

.87 

.84 

0* 

.74 

.74 

.79 

.30 

.85 

.83 

5 

.72 

.65 

.65 

.57 

.80 

.62 

.76 

.81 

.70 

.37 

.73 

.71 

Table  2:  Goodness  of  prediction  of  the  yi,  1/2, 2/3,  J/4  in  terms  of  R^.  Bold  numbers  indicate  the 
best  achieved  prediction  on  individual  response  variables.  Polynomial  kernels  of  different 
orders  a  were  used.  0*  represents  a  case  when  averaged  performance  of  the  model  is  worse 
than  assuming  a  linear  model  equal  to  the  mean  of  a  response  variable. 


Order 

a 

kernel  PLS 

kernel  PCR 

kernel  RR 

2/1 

2/2 

2/3 

2/4 

2/1 

2/2 

2/3 

2/4 

2/1 

2/2 

2/3 

2/4 

1 

.21 

.0* 

.71 

.78 

.39 

.0* 

.77 

.78 

.28 

.0* 

.76 

.76 

2 

.90 

.04 

.85 

.87 

.82 

.41 

.84 

.84 

.79 

.45 

.90 

.90 

3 

.91 

.59 

.85 

.86 

.87 

.63 

.55 

.81 

.52 

.0* 

.67 

.22 

4 

.93 

.75 

.68 

.66 

.81 

.58 

.88 

.88 

.68 

.71 

.78 

.75 

5 

.85 

.79 

.58 

.81 

.81 

.79 

.83 

.84 

.60 

.74 

.66 

.66 

Table  3:  Goodness  of  prediction  of  the  2/1, 2/2, 2/3, 2/4  in  terms  of  R^.  Bold  numbers  indicate  the 
best  achieved  prediction  on  individual  response  variables.  Polynomial  kernels  of  different 
orders  a  were  used.  0*  represents  a  case  when  averaged  performance  of  the  model  is  worse 
than  assuming  a  linear  model  equal  to  the  mean  of  a  response  variable. 


7.3  Chaotic  Mackey-Glass  Time-Series 

The  chaotic  Mackey-Glass  time-series  is  defined  by  the  differential  equation 


ds{t) 

dt 


-bis{t)  +  62 


s{t  -  t) 

1  +  s(t  — 


with  61  =  0.1,  62  =  0.2.  The  data  were  generated  with  r  =  17  and  using  a  second-order 
Runge-Kutta  method  with  a  step  size  0.1.  Training  data  is  from  t=200  to  t=3200  while  test 
data  is  in  the  range  t=  5000  to  5500.  To  this  generated  time-series  we  added  white  noise 
with  normal  distribution  and  with  different  levels  corresponding  to  ratios  of  the  standard 
deviation  of  the  noise  and  the  clean  Mackey-Glass  time- series. 

The  training  data  partitions  were  constructed  by  moving  a  “sliding  window”  over  the 
3000  training  samples  in  steps  of  250  samples.  This  window  had  a  size  of  500  samples.  The 
validation  set  was  then  created  by  using  the  following  250  data  points.  This  created  ten 
partitions  of  size  500/250  (training/validation)  samples. 
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All  regression  models  were  trained  to  predict  the  value  sampled  85  steps  ahead  (L  =  1) 
from  inputs  at  time  t,t  —  6,t  —  12,  t  —  18  {N  =  4). 

The  Gaussian  kernel  was  used.  We  estimated  the  variance  of  the  overall  clean  training 
set  and  based  on  this  estimate  =  0.05  the  CV  technique  was  used  to  find  the  optimal 
width  d  from  the  range  (0.01,0.2)  using  the  step  size  0.01.  A  fixed  test  set  of  size  500 
data  points  was  used  in  all  experiments.  The  performance  of  the  regression  models  to  make 
predictions  on  “clean”  test  set  of  500  data  points  was  evaluated  in  terms  of  NRMSE. 

The  results  achieved  using  the  individual  regression  models  are  summarized  in  Table  4. 
We  can  see  that  there  are  no  significant  differences  among  the  methods  employed.  However, 
comparing  kernel  PLS  and  kernel  PCR  we  can  observe  a  significant  reduction  in  the  number 
of  components  used  in  the  case  of  kernel  PLS  regression.  In  some  cases  kernel  PLS  uses  less 
than  10%  of  the  number  of  components  used  by  kernel  PCR. 

Increasing  the  value  of  d  leads  to  a  faster  decay  of  the  eigenvalues  (see,  e.g.,  Williamson 
et  ah,  1998)  and  to  the  potential  loss  of  the  fine  structure  due  to  a  smaller  number  of  nonlin¬ 
ear  principal  components  describing  the  same  percentage  of  all  the  data  variance.  Increasing 
levels  of  the  noise  has  the  tendency  to  increase  the  optimal  value  for  the  d  parameter  which 
coincides  with  the  intuitive  assumption  about  smearing  out  the  local  structure  (for  the  dis¬ 
cussion  on  this  topic  see  Rosipal  et  ah,  2001).  In  contrast  small  values  of  d  will  lead  to 
“memorizing”  of  the  training  data  structure.  Thus,  in  Figure  5  we  also  compared  the  results 
on  the  noisy  time  series  (n/s  =  22%)  and  their  dependence  on  the  width  d  of  the  Gaussian 
kernel.  Similar  behavior  was  observed  for  data  with  njs  =  11%.  We  may  observe  a  smaller 
range  of  the  d  values  on  which  kernel  PLS  and  kernel  PCR  achieves  the  optimal  results  on 
the  testing  set  compared  to  kernel  RR.  However,  the  results  also  suggest  a  smaller  variance 
in  the  case  of  latent  variable  projection  methods;  i.e.  kernel  PLS  and  kernel  PCR. 


Noise 

level 

kernel  PLS 

NRMSE  #  of  C 

kernel  PCR 

NRMSE  #  of  PC 

kernel  RR 

NRMSE 

n/s=0.0% 

0.048 

155 

0.046 

383 

0.044 

(0.031) 

(38) 

(0.030) 

(78) 

(0.027) 

n/s=ll% 

0.322 

7 

0.327 

79 

0.321 

(0.030) 

(2) 

(0.030) 

(35) 

(0.041) 

n/s=22% 

0.455 

6 

0.462 

48 

0.451 

(0.021) 

(2) 

(0.031) 

(24) 

(0.029) 

Table  4:  The  comparison  of  the  approximation  errors  (NRMSE)  of  prediction,  the  number  of  used 
components  (C)  and  principal  components  (PC).  The  values  represent  an  average  of  10 
simulations.  The  corresponding  standard  deviations  are  presented  in  parentheses. 


7.4  Human  Signal  Detection  Performance  Monitoring 

In  this  study  eight  male  Navy  technicians  experienced  in  the  operation  of  display  systems 
performed  a  signal  detection  task.  Event  related  potentials  (ERP)  and  performance  data 
from  an  earlier  study  (Trejo  and  Shensa,  1999,  Trejo  et  ah,  1995,  Koska  et  ah,  1997)  were 
used.  The  performance  of  the  subjects  was  measured  in  terms  of  PFl  measure  (L  =  1) 
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Figure  5:  Comparison  of  the  results  achieved  on  the  noisy  Mackey-Glass  {n/s=22%)  time  series 
with  the  kernel  PLS  (top),  kernel  PCR  (middle)  and  kernel  RR  (bottom)  methods.  Ten 
different  training  sets  of  size  500  data  points  were  used.  The  performance  for  different 
widths  (d)  of  the  Gaussian  kernel  is  compared  in  normalized  root  mean  squared  error 
(NRMSE)  terms.  The  error  bars  represent  the  standard  deviation  on  results  computed 
from  ten  different  runs. 


based  on  their  accuracy,  confidence  and  reaction  time  to  detect  relevant  stimuli.  For  details 
on  the  experimental  setting  see  (Rosipal  et  ah,  2001). 

The  results  achieved  on  individual  subjects  in  our  former  studies  informed  our  choice  of 
the  Gaussian  kernel.  For  each  individual  subject  we  split  the  data  into  10  different  55%  and 
45%  training  and  testing  partitions.  Eleven-fold  CV  to  estimate  desired  parameters  was 
applied  on  each  training  partition.  After  CV  a  final  model  was  tested  on  an  independent 
testing  partition.  This  was  repeated  10  times  for  each  training  and  testing  data  pair.  The 
validity  of  the  models  was  measured  in  terms  of  R^. 
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Table  5  summarizes  the  results  achieved  on  eight  subjects  (A  to  H),  using  kernel  PLS, 
kernel  PCR  and  kernel  RR  methods.  As  with  results  reported  on  the  Mackey-Glass  time 
series  prediction  we  can  not  observe  any  significant  differences  among  the  kernel  regression 
models.  The  number  of  components  used  in  the  case  of  kernel  PLS  is  on  average  10  times 
lower  when  compared  to  kernel  PCR. 


Subject 

kernel  PLS 

#  of  C 

kernel  PCR 

^  of  PC 

kernel  RR 

A 

0.841 

27.9 

0.841 

373.1 

0.841 

(891  ERP) 

(0.025) 

(16.3) 

(0.023) 

(30.9) 

(0.025) 

B 

0.883 

33.9 

0.882 

224.4 

0.883 

(592  ERP) 

(0.027) 

(12.3) 

(0.026) 

(32.2) 

(0.026) 

C 

0.741 

15.5 

0.740 

134.8 

0.747 

(417  ERP) 

(0.060) 

(4.1) 

(0.044) 

(17.9) 

(0.044) 

D 

0.870 

24.8 

0.870 

241.2 

0.874 

(702  ERP) 

(0.011) 

(6.0) 

(0.010) 

(50.3) 

(0.010) 

E 

0.942 

42.4 

0.941 

274.4 

0.943 

(734  ERP) 

(0.006) 

(20.1) 

(0.006) 

(42.2) 

(0.007) 

E 

0.884 

24.6 

0.875 

186.6 

0.886 

(614  ERP) 

(0.023) 

(9.9) 

(0.025) 

(61.4) 

(0.024) 

G 

0.895 

23.6 

0.893 

323.3 

0.895 

(868  ERP) 

(0.018) 

(13.5) 

(0.018) 

(53.2) 

(0.017) 

H 

0.827 

19.7 

0.825 

280.4 

0.827 

(776  ERP) 

(0.022) 

(7.0) 

(0.022) 

(49.0) 

(0.022) 

Table  5:  The  comparison  of  and  the  number  of  used  components  (C)  and  principal  components 
(PC),  respectively,  for  subjects  A  to  H.  The  values  represent  an  average  of  10  different 
simulations  and  the  corresponding  standard  deviations  are  presented  in  parentheses. 


8.  Discussion  and  Conclusions 

On  several  different  regression  tasks  we  compared  the  proposed  kernel  PLS  method  with 
kernel  PCR  and  kernel  RR  techniques.  We  show  that  kernel  PLS  provides  the  same  results 
as  kernel  PCR  and  kernel  RR.  However,  in  comparison  to  kernel  PCR,  the  kernel  PLS 
method  uses  a  much  smaller  number  of  qualitatively  different  components.  As  demonstrated 
in  Section  4.1,  using  the  existing  correlations  between  outputs  and  mapped  inputs,  kernel 
PLS  may  provide  components  which  follow  more  closely  the  investigated  nonlinear  function. 
However,  as  with  kernel  PCA,  the  interpretation  of  these  components  in  the  input  space 
may  be  difficult. 

There  exists  a  large  body  of  literature  comparing  standard  OLS  regression  with  PLS, 
PCR  and  RR  (see,  e.g.,  Frank  and  Friedman,  1993,  Stone  and  Brooks,  1990).  Assuming 
a  construction  of  regularized  linear  regression  models  in  a  RKHS  we  can  make  some  con¬ 
clusions  by  using  the  analogy  with  the  reported  observations.  First,  in  the  situation  where 
high  multicollinearity  among  regressors  exists  OLS  leads  to  the  unbiased  but  high  variance 
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estimate  of  regression  coefficients.  PLS,  PCR  and  RR  are  designed  to  shrink  the  solution  to 
the  regression  from  the  areas  of  the  low  data  spread  resulting  in  biased  but  lower  variance 
estimates.  Second,  there  exist  real  world  regression  problems  where  the  number  of  observed 
variables  N  significantly  exceeds  the  number  of  samples  (observations)  n — a  situation  quite 
common  in  chemometrics.  Moreover,  we  may  usually  also  observe  that  the  real  rank  of 
the  matrix  of  regressors  is  significantly  lower  than  n  and  N.  The  projection  of  the  original 
regressors  to  the  “real”  latent  variables  is  the  main  advantage  of  methods  such  as  PLS  or 
PCR.  This  is  also  similar  to  the  situation  where  the  input  variables  are  corrupted  by  a 
certain  amount  of  noise  (the  situation  with  noisy  Mackey-Glass  time  series  and  ERP  data 
sets).  By  the  projection  of  original  data  to  the  components  with  higher  eigenvalues  we 
may  usually  discard  the  noise  component  contained  in  the  original  data  (assuming  kernel 
PCR  as  discussed  in  Rosipal  et  ah,  2001).  We  hypothesize  that  both  situations  are  also 
quite  common  when  a  kernel-type  of  regression  is  used.  Usually  we  nonlinearly  transform 
the  original  data  to  the  high-dimensional  space  whose  dimension  M  is  in  many  cases  much 
higher  than  the  number  of  observations  M  ^  n.  Although,  assuming  high-dimensional 
feature  spaces,  we  cannot  diagnose  multicollinearity  by  the  inspection  of  the  sample  covari¬ 
ance  matrix,  the  indication  of  strong  multicollinearity  may  by  detected  from  a  large  ratio 
between  maximal  and  minimal  eigenvalues  of  the  covariance  matrix.  The  eigenvalues  can 
by  estimated  by  the  kernel  PCA  method.  Values  greater  than  100  usually  indicate  strong 
multicollinearity  among  regressors  (Montgomery  and  Peck,  1992).  In  many  cases,  in  our 
data  sets  we  observed  that  the  ratio  between  the  larger  and  smaller  eigenvalues  exceeded 
1000,  indicating  severe  multicollinearity.® 

The  proposed  kernel  PLS  uses  the  NIPALS  procedure  to  iteratively  estimate  the  desired 
components.  We  have  already  pointed  out  that  the  NIPALS  algorithm  is  very  similar 
to  the  power  method  and  as  with  this  method  was  found  to  be  very  robust  for  solving 
eigenvalue-eigenvector  related  problems  where  dominant  eigenvectors  are  calculated  one  at 
a  time.  The  rate  of  the  convergence  of  both  algorithms  is  given  by  the  ratio  of  two  largest 
eigenvalues  (Malthouse,  1995).  Both  the  NIPALS-PLS  procedure  described  in  Section  4 
and  the  deflation  procedure  (9)  used  after  the  extraction  of  the  individual  components  scale 
as  0{n^).  The  need  to  repeat  these  procedures  increases  the  computational  costs  in  direct 
proportion  to  the  number  of  desired  components.  However,  on  the  employed  data  sets  we 
have  demonstrated  that  the  best  results  are  achieved  with  the  number  of  components  p  <C  n. 
Moreover,  investigating  the  curves  of  the  PRESS  statistics  computed  on  validation  sets  we 
observed  that  we  usually  do  not  need  to  investigate  a  wide  spectrum  of  components  as  a 
rather  strong  increase  of  PRESS  occurs  after  extracting  the  optimal  number  of  components. 
Assuming  n  >  L,  the  procedure  (11)  for  the  estimation  of  the  desired  training  data  set 
output  values  scales  as  0{pnL).  Using  the  procedure  (12)  to  make  the  prediction  computed 
on  a  single  test  data  point,  the  complexity  scales  as  0{pv?  +p^),  where  the  second  term 
associated  with  the  inversion  of  the  {p  x  p)  matrix  may  be  neglected  in  the  case  that  p  <^n. 
Moreover,  in  this  procedure  we  do  not  even  need  to  invert  the  whole  (n  x  n)  Gram  matrix. 
In  fact  using  the  kernel  PLS  method  on  large  scale  regression  tasks  we  may  avoid  storing 
the  whole  Gram  matrix  K.  Re-computation  of  its  elements  may,  however,  significantly  slow 
down  the  whole  algorithm. 

5.  We  investigated  the  situation  by  using  the  Gaussian  kernel  with  different  width  parameters  and  polyno¬ 
mial  kernels  of  different  orders. 
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